Introduction

This is my report which analyzes grad school admissions data. I have included various models and predictions. I explore the relationships between GRE, GPA, and rank with admissions, and compare different models of it.

Loading Libraries and Data - Start Here!

Load required libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(tidyr)
library(performance)
library(plotly)
library(reshape2)
library(magrittr)
library(vegan)

Data set
data <- read.csv(“GradSchool_Admissions.csv”)

Display column names and summary
colnames(data)
summary(data)

Click to expand Code for Libraries, Data Set
# Load required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::%||%()   masks base::%||%()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(tidyr)
library(performance)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(magrittr)
## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(vegan)
## Loading required package: permute
## Loading required package: lattice
# Data set
data <- read.csv("GradSchool_Admissions.csv")


# Display column names and summary
colnames(data)  
## [1] "admit" "gre"   "gpa"   "rank"
summary(data)
##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000

Exploring the Data

Relationships between GRE, GPA, and Rank

Testing the drop down option here.
Click to expand Code for Relationships
# Build basic models
model_gre_admit <- glm(admit ~ gre, data = data, family = binomial)
summary(model_gre_admit)
## 
## Call:
## glm(formula = admit ~ gre, family = binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.901344   0.606038  -4.787 1.69e-06 ***
## gre          0.003582   0.000986   3.633  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 486.06  on 398  degrees of freedom
## AIC: 490.06
## 
## Number of Fisher Scoring iterations: 4
model_gpa_admit <- glm(admit ~ gpa, data = data, family = binomial)
summary(model_gpa_admit)
## 
## Call:
## glm(formula = admit ~ gpa, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.3576     1.0353  -4.209 2.57e-05 ***
## gpa           1.0511     0.2989   3.517 0.000437 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 486.97  on 398  degrees of freedom
## AIC: 490.97
## 
## Number of Fisher Scoring iterations: 4
model_rank_admit <- glm(admit ~ rank, data = data, family = binomial)
summary(model_rank_admit)
## 
## Call:
## glm(formula = admit ~ rank, family = binomial, data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.6366     0.3061   2.080   0.0375 *  
## rank         -0.5863     0.1240  -4.728 2.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 475.71  on 398  degrees of freedom
## AIC: 479.71
## 
## Number of Fisher Scoring iterations: 4

Plotting GRE vs GPA and Correlation Test

Just want to see the plot here..

# Scatter plot of GRE vs GPA
plot(data$gre, data$gpa)

cor(data$gre, data$gpa)
## [1] 0.3842659
cor.test(data$gre, data$gpa)
## 
##  Pearson's product-moment correlation
## 
## data:  data$gre and data$gpa
## t = 8.3036, df = 398, p-value = 1.596e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2974203 0.4648047
## sample estimates:
##       cor 
## 0.3842659

Predicted Admission Probabilites

# Data frame for plotting predicted probabilities
plot_data <- data.frame(
  gre = data$gre,
  gpa = data$gpa,
  rank = data$rank,
  admit = data$admit,
  pred_gre = predict(model_gre_admit, type = "response"),
  pred_gpa = predict(model_gpa_admit, type = "response"),
  pred_rank = predict(model_rank_admit, type = "response"))

# Plotting predicted probabilities for GRE, GPA, and Rank
ggplot(plot_data, aes(x = gre, y = pred_gre)) +
  geom_line(color = "blue") +
  geom_point(aes(y = admit), alpha = 0.3) +
  labs(title = "Admission Probability vs GRE Score", x = "GRE Score", y = "Admission Probability") +
  theme_minimal()

That didn’t turn out how I wanted, didn’t give me a lot of information. I decided to tweak it a bit and run it by comparing models.

Comparing Models

# Models with varying complexities
model_gre_gpa_admit <- glm(admit ~ gre + gpa, data = data, family = binomial)
model_full_admit <- glm(admit ~ gre + gpa + rank, data = data, family = binomial)

# Compare model performance
comp <- compare_performance(model_gre_gpa_admit, model_full_admit) %>%
  plot()
print(comp)

I liked this more, but I wanted a 3D image, and I ended up needing help on this part.

gre_vals <- seq(min(data$gre), max(data$gre), length.out = 100)
gpa_vals <- seq(min(data$gpa), max(data$gpa), length.out = 100)
rank_levels <- unique(data$rank)

prediction_grid <- expand.grid(gre = gre_vals,
                               gpa = gpa_vals,
                               rank = rank_levels)

prediction_grid$predicted_prob <- predict(model_full_admit, newdata = prediction_grid, type = "response")

# 3D Plot using plotly
prediction_grid_rank1 <- prediction_grid[prediction_grid$rank == 1, ]

z_matrix <- acast(prediction_grid_rank1, gpa ~ gre, value.var = "predicted_prob")

plotly::plot_ly(
  x = gre_vals,
  y = gpa_vals,
  z = z_matrix
) %>%
  plotly::add_surface(colorscale = "Blues") %>%
  plotly::layout(
    title = list(text = "Predicted Admission Probability (GRE * GPA Interaction, Rank 1)"),
    scene = list(
      xaxis = list(title = "GRE"),
      yaxis = list(title = "GPA"),
      zaxis = list(title = "Admission Probability")
    )
  )

Click the image and move it around!!

Although its pretty cool, I still am just playing with predictions.
Now I wanted to test the model performance!

Model performance on Test Data

# Train/test split
set.seed(123)
train_indices <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

# Fit the interaction model
model_interact <- glm(admit ~ gre * gpa + rank, data = train_data, family = binomial)

# Predictions and accuracy
test_data$predicted_prob <- predict(model_interact, newdata = test_data, type = "response")
test_data$predicted_class <- ifelse(test_data$predicted_prob >= 0.5, 1, 0)

# Confusion matrix and accuracy
table(Predicted = test_data$predicted_class, Actual = test_data$admit)
##          Actual
## Predicted  0  1
##         0 78 23
##         1 11  8
mean(test_data$predicted_class == test_data$admit)
## [1] 0.7166667

Now to train the data and test the predictions against the actual data!

Training

# Train/test split
set.seed(123)
train_indices <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
# interaction model
model_interact <- glm(admit ~ gre * gpa + rank, data = train_data, family = binomial)
# add probabilites and make them binary
test_data$predicted_prob <- predict(model_interact, newdata = test_data, type = "response")
# convert to lables and a threshold
test_data$predicted_class <- ifelse(test_data$predicted_prob >= 0.5, 1, 0)
# confusion matrix
table(Predicted = test_data$predicted_class, Actual = test_data$admit)
##          Actual
## Predicted  0  1
##         0 78 23
##         1 11  8
# finally accuracy
mean(test_data$predicted_class == test_data$admit)
## [1] 0.7166667

In the End:

I was able to explore the data. Explored correlations between variables - GRE and GPA were found to have a weak correlation. We created Logistic Regression models based on admissions. We then built up more complex models and compared their performance - The interaction model performed the best. Then we had a lot of visualization option to show these performances. Then, I split the data to check for validation. I split it and then trained the interaction model and ran that against the real data.

Finally: My final model correctly predicted admission decisions at ~71.7% which in a regression model, this is a great result.

However: That also means there is ~28.3% unexplained. This could be due to noise in the data set, unmeasured factors, or inherent unpredictability.

Thanks!

Bonus - Dimensionality Reduction - non-metric multidimensional scaling (NMDS)

# Select only numeric predictors (you can scale them if needed)
data_numeric <- data %>%
  select(gre, gpa, rank)

# NMDS using Bray-Curtis dissimilarity (common for mixed data)
nmds_result <- metaMDS(data_numeric, distance = "euclidean", k = 2, trymax = 100)
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 2.149938e-18 
## Run 1 stress 0.0001522356 
## ... Procrustes: rmse 4.181156e-05  max resid 0.0003309861 
## ... Similar to previous best
## Run 2 stress 0.0001165621 
## ... Procrustes: rmse 3.056482e-05  max resid 0.0003244857 
## ... Similar to previous best
## Run 3 stress 0.04298563 
## Run 4 stress 6.841302e-05 
## ... Procrustes: rmse 1.507768e-05  max resid 0.0001428006 
## ... Similar to previous best
## Run 5 stress 0.02643435 
## Run 6 stress 0.04880098 
## Run 7 stress 0.03624875 
## Run 8 stress 0.0001140337 
## ... Procrustes: rmse 3.113174e-05  max resid 0.0001891792 
## ... Similar to previous best
## Run 9 stress 0.02513349 
## Run 10 stress 8.170448e-05 
## ... Procrustes: rmse 1.963389e-05  max resid 0.000189902 
## ... Similar to previous best
## Run 11 stress 0.02545743 
## Run 12 stress 0.0536762 
## Run 13 stress 9.656149e-05 
## ... Procrustes: rmse 2.356572e-05  max resid 0.0004593796 
## ... Similar to previous best
## Run 14 stress 0.0001233483 
## ... Procrustes: rmse 3.505467e-05  max resid 0.0002209537 
## ... Similar to previous best
## Run 15 stress 9.361161e-05 
## ... Procrustes: rmse 2.49622e-05  max resid 0.0001717053 
## ... Similar to previous best
## Run 16 stress 0.02545695 
## Run 17 stress 0.02643403 
## Run 18 stress 8.312258e-05 
## ... Procrustes: rmse 2.027429e-05  max resid 0.0002742561 
## ... Similar to previous best
## Run 19 stress 0.02513339 
## Run 20 stress 0.04298578 
## *** Best solution repeated 9 times
## Warning in metaMDS(data_numeric, distance = "euclidean", k = 2, trymax = 100):
## stress is (nearly) zero: you may have insufficient data
# Check stress (how good the fit is)
nmds_result$stress
## [1] 2.149938e-18
# Extract NMDS scores (coordinates in 2D space)
nmds_points <- as.data.frame(nmds_result$points)
nmds_points$admit <- as.factor(data$admit)

# Plot using ggplot

ggplot(nmds_points, aes(x = MDS1, y = MDS2, color = admit)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "NMDS of Grad School Admissions",
       x = "NMDS Dimension 1", y = "NMDS Dimension 2",
       color = "Admission Status") +
  theme_minimal()

Similar scores cluster together - that would be good. 2 separate groups - admitted and non-admitted, that also would be great. I did not have the latter. I do think it shows 2 clusters, or at least the start of some clustering.